R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Libraries downloading

library(httr)
library(dplyr)
## 
## Присоединяю пакет: 'dplyr'
## Следующие объекты скрыты от 'package:stats':
## 
##     filter, lag
## Следующие объекты скрыты от 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(readxl)
library(ggplot2)
library(vegan)
## Загрузка требуемого пакета: permute
## Загрузка требуемого пакета: lattice
## This is vegan 2.5-7
library(ggfortify)
library(plotly)
## 
## Присоединяю пакет: 'plotly'
## Следующий объект скрыт от 'package:ggplot2':
## 
##     last_plot
## Следующий объект скрыт от 'package:httr':
## 
##     config
## Следующий объект скрыт от 'package:stats':
## 
##     filter
## Следующий объект скрыт от 'package:graphics':
## 
##     layout
library(DESeq2)
## Загрузка требуемого пакета: S4Vectors
## Загрузка требуемого пакета: stats4
## Загрузка требуемого пакета: BiocGenerics
## 
## Присоединяю пакет: 'BiocGenerics'
## Следующие объекты скрыты от 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## Следующие объекты скрыты от 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## Следующие объекты скрыты от 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Присоединяю пакет: 'S4Vectors'
## Следующий объект скрыт от 'package:plotly':
## 
##     rename
## Следующий объект скрыт от 'package:tidyr':
## 
##     expand
## Следующие объекты скрыты от 'package:dplyr':
## 
##     first, rename
## Следующие объекты скрыты от 'package:base':
## 
##     expand.grid, I, unname
## Загрузка требуемого пакета: IRanges
## 
## Присоединяю пакет: 'IRanges'
## Следующий объект скрыт от 'package:plotly':
## 
##     slice
## Следующие объекты скрыты от 'package:dplyr':
## 
##     collapse, desc, slice
## Следующий объект скрыт от 'package:grDevices':
## 
##     windows
## Загрузка требуемого пакета: GenomicRanges
## Загрузка требуемого пакета: GenomeInfoDb
## Загрузка требуемого пакета: SummarizedExperiment
## Загрузка требуемого пакета: MatrixGenerics
## Загрузка требуемого пакета: matrixStats
## 
## Присоединяю пакет: 'matrixStats'
## Следующий объект скрыт от 'package:dplyr':
## 
##     count
## 
## Присоединяю пакет: 'MatrixGenerics'
## Следующие объекты скрыты от 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Загрузка требуемого пакета: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Присоединяю пакет: 'Biobase'
## Следующий объект скрыт от 'package:MatrixGenerics':
## 
##     rowMedians
## Следующие объекты скрыты от 'package:matrixStats':
## 
##     anyMissing, rowMedians
## Следующий объект скрыт от 'package:httr':
## 
##     content
library(pheatmap)

0. Download data and describe it.

We are going to find potential genes responsible for Down syndrome development in mice model and test the dependency statistically.

Let’s download the data (was tricky):

#Make the link as a variable (for conviniency of reading)

url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00342/Data_Cortex_Nuclear.xls'
GET(url, write_disk(TF <- tempfile(fileext = ".xls")))
## Response [https://archive.ics.uci.edu/ml/machine-learning-databases/00342/Data_Cortex_Nuclear.xls]
##   Date: 2022-02-18 06:46
##   Status: 200
##   Content-Type: application/x-httpd-php
##   Size: 1.63 MB
## <ON DISK>  C:\Users\sukha\AppData\Local\Temp\Rtmpkfm2bR\file114b07e65a27.xls
#Read data
data <- read_excel(TF)

Task 1. EDA analysis:

Now lets estimate following parametres: 1) Number of mice in the experiment; 2) Number of complete observations; 3) Which groups can be observed; 4) How balanced these groups are.

#Count mice' number:
data2 <- data %>% separate(MouseID, into = c('MouseID', 'rep'), sep = '_')
length(unique(data2$MouseID))
## [1] 72
#Count number of complete observations:
table(sapply(data[,2:78], complete.cases))[2]
##  TRUE 
## 81764
#Group by nominal varaiables and count thier volumes and means to analyze the balance* btw groups:
data_long <- data2 %>% pivot_longer(cols = c(3:79))
data_long <- data_long[complete.cases(data_long$value),]
data_long %>% group_by(Genotype, Treatment, Behavior, class) %>% na.omit() %>% summarise(total = n(), mean = mean(value), .groups = "keep")
## # A tibble: 8 x 6
## # Groups:   Genotype, Treatment, Behavior, class [8]
##   Genotype Treatment Behavior class  total  mean
##   <chr>    <chr>     <chr>    <chr>  <int> <dbl>
## 1 Control  Memantine C/S      c-CS-m 11340 0.692
## 2 Control  Memantine S/C      c-SC-m 11292 0.663
## 3 Control  Saline    C/S      c-CS-s 10196 0.703
## 4 Control  Saline    S/C      c-SC-s 10275 0.660
## 5 Ts65Dn   Memantine C/S      t-CS-m 10260 0.658
## 6 Ts65Dn   Memantine S/C      t-SC-m 10170 0.709
## 7 Ts65Dn   Saline    C/S      t-CS-s  8040 0.657
## 8 Ts65Dn   Saline    S/C      t-SC-s 10191 0.684
#data_l_qq$group <- paste0(data_l_qq$Genotype, data_l_qq$Treatment, data_l_qq$Behavior, data_l_qq$class, sep = '_')
#data_l_qq$group <- paste(data_l_qq$Genotype, data_l_qq$Treatment, data_l_qq$Behavior, data_l_qq$class, sep = '_')
#data_new <- data_l_qq
#data_new <- data_l_qq[,c(1,2,7,8,9,10)]
#g = unique(data_new$group

According to the analysis we see that there are only 72 mice, however the total number of complete observations is 81764.

To make clear which groups are presented data formatting was approached, concentrating on nominal dependent variables: Genotype, Treatment, Behaviour (or simply class variable, which includes division by mentioned three columns). In this case there are eight groups.

Looking at means and volumes of groups we can say that generally they are balanced, however one group significantly differs in volume which defines its weakened mean value (with more group’s observations its mean could extremely differ from the rest ones).

data_BDNF_N <- data_long[data_long$name=='BDNF_N',]
#To do statistical analysis we have to check the form of distributions - for this visualisation can help:

ggplot(data_BDNF_N, aes(x=value, group = class, fill =  class)) +
    geom_density(adjust=1.5, alpha = 0.4) +
    theme_bw() +
    facet_wrap(~class) +
    theme(
        legend.position="none",
        panel.spacing = unit(0.1, "lines"),
        axis.ticks.x=element_blank()
    )

To be more accurate statistical estimation should be provided - for this one of the method is to use shapiro test, which check if your variable has parametric or non-parametric ditridbution - for this the function is preferable to be created:

shapiro.test_pvalue<-function(x){shapiro.test(x)$p.value}
shapiro_data <- apply(X = data[,2:78], MARGIN = 2, FUN = shapiro.test_pvalue)

Task 2. BDNF analysis

Herein we see, that the majority of genes has non-parametric distributions, which not allows to test them with t.test or chisqueare-test. Neertheless, the statistics was gained on population - this can significantly hide deviations of subgroups (tested classes). Therefore, the shapiro test will be applied on each class to test subgroups ditributions:

shapiro_group <- function(vars, df, gene){
    out = as.data.frame(matrix(ncol = 1, nrow = length(vars)), row.names = vars)
    #colnames(out) = c('p,value', 'p.adjusted.holm', 'p.adjusted.BH')
    colnames(out) = 'shapiro_pv'
    df = df[df$name == gene,]
    for (i in vars){
        #col_n = paste0('p.value_',i)
        out[i,1] = shapiro.test_pvalue(subset(df,class==i)$value)
        #out[i,2] = p.adjust(out[i,1], method = 'holm')
        #out[i,3] = p.adjust(out[i,1], method = 'BH')
    }
    out$class = rownames(out)
    rownames(out) = c(1:nrow(out))
    out[,c(2,1)]
}
v = unique(data_long$class)
BDNF_N_class_shapiro  = shapiro_group(v, data_long, 'BDNF_N')
BDNF_N_class_shapiro[BDNF_N_class_shapiro$shapiro_pv<0.05,]
##    class   shapiro_pv
## 2 c-SC-m 7.044136e-04
## 5 t-CS-m 1.543850e-02
## 6 t-SC-m 4.634791e-03
## 7 t-CS-s 1.394412e-05

From all 8 classes for BDNF_N gene 4 have non-parametric distribution: c-SC-m,t-CS-m, t-SC-m, t-CS-m. This means that different dependencies in popultion could not be checked by one test - 4 of theme will be tested with parametric options, 4 others - by non-parametric ones.

Assign vectors of classes which will be tested by parametric or non-parametric method:

par_vars = BDNF_N_class_shapiro[BDNF_N_class_shapiro$shapiro_pv>0.05,1]
nonpar_vars = BDNF_N_class_shapiro[BDNF_N_class_shapiro$shapiro_pv<0.05,1]
#Make function for automatic statistical estimation (could be used for each gene and with parametric/non-parametric tests)
cor_test <- function(vars, df, gene, test){
    out = as.data.frame(matrix(ncol = 1, nrow = length(vars)), row.names = vars)
    #colnames(out) = c('p,value', 'p.adjusted.holm', 'p.adjusted.BH')
    colnames(out) = 'p.value'
    df = df[df$name == gene,]
    for (i in vars){
        col_n = paste0('p.value_',i)
        cur = df
        cur[,6] = with(cur, ifelse(class == i,1,0))
        if (test == 'parametric'){
            out[i,1] = t.test(cur$value~cur$class)$p.value
        } else {
            out[i,1] = wilcox.test(cur$value~cur$class)$p.value
        }
        #out[i,2] = p.adjust(out[i,1], method = 'holm')
        #out[i,3] = p.adjust(out[i,1], method = 'BH')
    }
    out$class = rownames(out)
    rownames(out) = c(1:nrow(out))
    out[,c(2,1)]
}

BDNF_N_class_t  = cor_test(v, data_long, 'BDNF_N', test = 'parametric')
BDNF_N_class_t[BDNF_N_class_t$p.value<0.05,]
##    class      p.value
## 1 c-CS-m 5.833146e-08
## 2 c-SC-m 1.660404e-17
## 3 c-CS-s 3.045948e-07
## 7 t-CS-s 1.084932e-03
BDNF_N_class_npar  = cor_test(nonpar_vars, data_long, 'BDNF_N', test = 'nonparametric')
BDNF_N_class_npar[BDNF_N_class_npar$p.value<0.05,]
##    class      p.value
## 1 c-SC-m 2.348884e-14
## 4 t-CS-s 2.001151e-04

We see that only four groups have significant correlation between expression and class type: c-CS-m, c-SC-m, c-CS-s, t-CS-s.

* - Peculiarly, if t.test is launched without division on parametric and non-parametric classes, the statistics observes the same groups as stristically significant for the hypothesis.

Besides manual method, the reggression approach is also suitable here. We can build manually the linear regression model or even use ANOVA test (though, someone could say that MANOVA is more preferable here).

Task 3.

Apparently, our data - gene expression - ought to be tested by linear model, as type of variable is continuous. But if we remember the results of distribution tyoe tests and look at values, we see that, firstly, the majority of our variables is NOT paametric and, secondly, values are not normilized (e.g. some genes have more than 3 times reater than ERBB4_N levels). Thus linear model on such a data would give invalid results and could not be reliable (see lower)

#Data preparation:
data_wo_na = na.omit(data2)

ERBB4_N_glm = glm(ERBB4_N~., data = data_wo_na[,-c(1,2,80:83)], family = 'binomial')
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(ERBB4_N_glm)
## 
## Call:
## glm(formula = ERBB4_N ~ ., family = "binomial", data = data_wo_na[, 
##     -c(1, 2, 80:83)])
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.061536  -0.009716  -0.000313   0.008226   0.087229  
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -2.7163061  3.9355287  -0.690    0.490
## DYRK1A_N        -0.0508526  3.4927882  -0.015    0.988
## ITSN1_N          0.0503768  4.9715398   0.010    0.992
## BDNF_N           0.2661558  8.9865405   0.030    0.976
## NR1_N           -0.1138142  3.1146131  -0.037    0.971
## NR2A_N          -0.0005766  0.6338903  -0.001    0.999
## pAKT_N           0.5168581 14.3101417   0.036    0.971
## pBRAF_N         -0.2985884 14.3516645  -0.021    0.983
## pCAMKII_N        0.0007856  0.3716956   0.002    0.998
## pCREB_N         -0.5008826 13.5372075  -0.037    0.970
## pELK_N           0.0011319  0.4846162   0.002    0.998
## pERK_N          -0.1501479  3.7150310  -0.040    0.968
## pJNK_N          -0.4272964 11.9274529  -0.036    0.971
## PKCA_N           0.0620690 12.7493974   0.005    0.996
## pMEK_N           0.0107050 12.7066200   0.001    0.999
## pNR1_N          -0.1869606  6.2013616  -0.030    0.976
## pNR2A_N          0.0777691  3.3858141   0.023    0.982
## pNR2B_N          0.1497077  3.1180292   0.048    0.962
## pPKCAB_N         0.0384360  1.2945537   0.030    0.976
## pRSK_N           0.0665511  6.6898413   0.010    0.992
## AKT_N            0.0149349  3.8136897   0.004    0.997
## BRAF_N           0.1346214  5.7559708   0.023    0.981
## CAMKII_N         0.0163273  9.4487542   0.002    0.999
## CREB_N          -0.0725953 14.5920046  -0.005    0.996
## ELK_N            0.0318107  1.7404469   0.018    0.985
## ERK_N            0.0340979  1.3140374   0.026    0.979
## GSK3B_N         -0.0365152  3.2085105  -0.011    0.991
## JNK_N           -0.1032776 15.8743539  -0.007    0.995
## MEK_N            0.0198319 10.2988985   0.002    0.998
## TRKA_N           0.0266700  7.7562855   0.003    0.997
## RSK_N           -0.1413711 18.6128726  -0.008    0.994
## APP_N           -0.0227248  8.6096226  -0.003    0.998
## Bcatenin_N       0.0132142  2.4632434   0.005    0.996
## SOD1_N          -0.0390579  1.8486544  -0.021    0.983
## MTOR_N           0.3399847  7.7485104   0.044    0.965
## P38_N           -0.1267767  6.3408707  -0.020    0.984
## pMTOR_N         -0.0784328  3.6370634  -0.022    0.983
## DSCR1_N         -0.0214529  4.8597573  -0.004    0.996
## AMPKA_N          0.2000331 10.7317051   0.019    0.985
## NR2B_N           0.0248844  4.2469942   0.006    0.995
## pNUMB_N         -0.0928735  9.4077742  -0.010    0.992
## RAPTOR_N         0.3791597 11.5622602   0.033    0.974
## TIAM1_N         -0.2615716  9.1740912  -0.029    0.977
## pP70S6_N         0.0177258  2.5012219   0.007    0.994
## NUMB_N          -0.4960149 17.9626789  -0.028    0.978
## P70S6_N         -0.0273069  2.3970905  -0.011    0.991
## pGSK3B_N         0.9892338 19.0306817   0.052    0.959
## pPKCG_N         -0.0496499  0.9145570  -0.054    0.957
## CDK5_N          -0.0120628  4.7926787  -0.003    0.998
## S6_N             0.1503379  3.6011249   0.042    0.967
## ADARB1_N        -0.0155827  0.9575418  -0.016    0.987
## AcetylH3K9_N    -0.0084418  5.1702531  -0.002    0.999
## RRP1_N          -0.1313356  4.9367349  -0.027    0.979
## BAX_N           -0.8748686 16.7247265  -0.052    0.958
## ARC_N            1.3196293 31.7762541   0.042    0.967
## nNOS_N          -0.0530179 13.4712403  -0.004    0.997
## Tau_N            0.5460545  8.4112484   0.065    0.948
## GFAP_N          -0.3741464 25.3918428  -0.015    0.988
## GluR3_N         -0.0182368 11.4773083  -0.002    0.999
## GluR4_N         -0.5659503 15.9436357  -0.035    0.972
## IL1B_N           0.1962498  5.0705538   0.039    0.969
## P3525_N          0.4290196 11.3826417   0.038    0.970
## pCASP9_N         0.0697727  1.4275073   0.049    0.961
## PSD95_N          0.1815490  1.7786663   0.102    0.919
## SNCA_N          -0.0487186 16.3893662  -0.003    0.998
## Ubiquitin_N      0.0165530  2.2583876   0.007    0.994
## pGSK3B_Tyr216_N  0.1980406  4.6672795   0.042    0.966
## SHH_N            0.3983810  9.3344732   0.043    0.966
## BAD_N           -0.2311394 16.5347535  -0.014    0.989
## BCL2_N          -0.0027127 14.5797976   0.000    1.000
## pS6_N                   NA         NA      NA       NA
## pCFOS_N         -0.0883528 14.6458087  -0.006    0.995
## SYP_N            0.1211354  5.0511457   0.024    0.981
## H3AcK18_N       -0.0049677 11.8331758   0.000    1.000
## EGR1_N          -0.0830582 12.3789058  -0.007    0.995
## H3MeK4_N         0.1190867 10.1203339   0.012    0.991
## CaNA_N          -0.0007549  1.6010933   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.92538  on 551  degrees of freedom
## Residual deviance: 0.12542  on 476  degrees of freedom
## AIC: 339.24
## 
## Number of Fisher Scoring iterations: 5

The quality of our model is described by following parameters: Null deviance, Residual deviance, AIC.

As we can see if only ERBB4_N is taken into account for its expression prediction, we have 92% of accuracy. However, if we consider other genes as predictors - the accuracy decreases till 12.5%, which signifies about poor model. If we even calculate p-value of Chi-square statistics (Null-Resildual deviances with 75 df [null df - res df]), we will come up with p-value=1.0000, which once more tells us about the bad quality of the model. And slightly high AIC metric also proves the previous statement.

In comparison, linear model summary indicates about high quality of model as multiple R^2 (the proportion of response explained by predictors) is pretty high, 86% (even with number of observation correction, adjusted R^2). And overall p-value of model is far away from 0.05.

Task 4: PCA analysis

To define division of our dataset, possible clusterization of genes into significant/not significant groups and uniqly deviate genes (probably, associated with disease) PCA method will be used. It simply shows the overall structure of dataset (herein only expression data will be analyzed) and make it clear associations of expression data and groups.

#Make rda model and define on what component our data is expained by more than 70 %.

genes_rda = rda(data_wo_na[,-c(1:2,80:83)], scale = T)
screeplot(genes_rda, type = "lines", bstick = TRUE)

summ_rda = summary(genes_rda)

perc_count = 0
i = 1
while (perc_count <= 0.70){
    perc_count = perc_count + summ_rda$cont$importance[2,i]
    i =  i + 1
}
cat("Number of PCA, explaining at least 70 % of ditribution: ", i,"\n","Explained percent: ", perc_count)
## Number of PCA, explaining at least 70 % of ditribution:  6 
##  Explained percent:  0.7069859

We can see that from 6th component our dataset is eplained by more than 70%. However, to define which exact components represent our dataset division into groups, pairwise vizualisation of components should be provided. It could be approached as wuth basic R packages (biplot and data extraction) as with extensions as ggfortify and pca tool prcomp.

PCA visualization and factor analysis provide an opportunity to determine more significant genes by factors’ loadings and their vector of association. To achieve not only vectors but also values - prcomp function should implemented (princomp will also define vector with its strength).

PCA model construction and factor analysis:

pca_fortify = prcomp(data_wo_na[,-c(1:2,80:83)], scale = T)

autoplot(pca_fortify, data = data_wo_na[,-c(1:2,80:83)], loadings = T, loadings.label.size = 2, loadings.label = T, loadings.colour = 'lightseagreen', loadings.label.colour = 'violetred', alpha = 0)

#components<-as.data.frame(scores(genes_rda, display = "species", choices = c(1:6), scaling = 0))
#components

As we could see, in the case of our dataset there’re slightly defined groups of genes, explaining more 1st or 2nd components, and the separate one, which shows neutral association with any of PCs. Lets see, which genes better explains PC1 and PC2.

#Determine qunatiles of genes' distances from PC1:

pc_g_contib = as.data.frame(scores(genes_rda, display = 'species', choices = c(1, 2), scaling = 'species', correlation = TRUE))

quantile(abs(pc_g_contib$PC1))
##          0%         25%         50%         75%        100% 
## 0.007753728 0.138828554 0.308939994 0.423944067 0.557217480
quantile(abs(pc_g_contib$PC2))
##          0%         25%         50%         75%        100% 
## 0.002286557 0.120173042 0.191688275 0.326803569 0.486396614
#Show genes, lying from Q3 to Q4.

Herein, the absolute values of genes significance in the case of each component are higher than: 0.42 and 0.32, consequently. Next to determine which genes contibute mostly into PC1 - those with absolute factor loadings >= 0.42 and < 0.32 should be chosen. And for PC2 - with factor loadings >= 0.32 and < 0.42, consequently.

#Extract PC1 significant genesL
pc1_signif = pc_g_contib %>% abs() %>% subset(PC1 >= 0.42 & PC2 < 0.32)

##And their indeed values:
pc_g_contib[rownames(pc_g_contib) %in% rownames(pc1_signif),]
##                   PC1          PC2
## BDNF_N     -0.5397593 -0.123978495
## NR1_N      -0.5437898 -0.108054034
## NR2A_N     -0.4920987 -0.138413219
## pCREB_N    -0.5266473  0.037344855
## pJNK_N     -0.4940580  0.205497447
## PKCA_N     -0.4498513 -0.160783225
## pMEK_N     -0.4793052  0.247121967
## pNR1_N     -0.5120032 -0.044891064
## pNR2B_N    -0.5271539 -0.030113018
## AKT_N      -0.4611025  0.132536331
## CAMKII_N   -0.4505762  0.210133626
## ELK_N      -0.4757119 -0.175576529
## ERK_N      -0.4827344 -0.251857691
## JNK_N      -0.4970278  0.036329848
## MEK_N      -0.5545068  0.021968889
## TRKA_N     -0.5572175 -0.157307657
## APP_N      -0.4239441 -0.159792864
## Bcatenin_N -0.5320550 -0.120173042
## AMPKA_N    -0.4288820  0.039036925
## SYP_N      -0.4348027 -0.002286557
#Extract PC2-significant genes:
pc_g_contib %>% abs() %>% subset(PC2 >= 0.32 & PC1 < 0.42, select = PC2)
##                   PC2
## DYRK1A_N    0.4219015
## ITSN1_N     0.4298712
## pAKT_N      0.3268036
## pERK_N      0.4416798
## pPKCAB_N    0.3484183
## BRAF_N      0.3562907
## GSK3B_N     0.4154438
## SOD1_N      0.3869171
## P38_N       0.4631735
## NUMB_N      0.3354952
## S6_N        0.3224426
## ARC_N       0.3561059
## IL1B_N      0.4041782
## SNCA_N      0.4863966
## Ubiquitin_N 0.3737999
## BAD_N       0.3273335
## BCL2_N      0.4070909
## pS6_N       0.3561059
## EGR1_N      0.4655561
## H3MeK4_N    0.3990472
## CaNA_N      0.4477004
##And their indeed values
pc_g_contib[rownames(pc_g_contib) %in% rownames(pc1_signif),]
##                   PC1          PC2
## BDNF_N     -0.5397593 -0.123978495
## NR1_N      -0.5437898 -0.108054034
## NR2A_N     -0.4920987 -0.138413219
## pCREB_N    -0.5266473  0.037344855
## pJNK_N     -0.4940580  0.205497447
## PKCA_N     -0.4498513 -0.160783225
## pMEK_N     -0.4793052  0.247121967
## pNR1_N     -0.5120032 -0.044891064
## pNR2B_N    -0.5271539 -0.030113018
## AKT_N      -0.4611025  0.132536331
## CAMKII_N   -0.4505762  0.210133626
## ELK_N      -0.4757119 -0.175576529
## ERK_N      -0.4827344 -0.251857691
## JNK_N      -0.4970278  0.036329848
## MEK_N      -0.5545068  0.021968889
## TRKA_N     -0.5572175 -0.157307657
## APP_N      -0.4239441 -0.159792864
## Bcatenin_N -0.5320550 -0.120173042
## AMPKA_N    -0.4288820  0.039036925
## SYP_N      -0.4348027 -0.002286557

Hence, in our data we see that there are two groups specific to PC1 and PC2 components and others two, which have less strong association to components and our distribution.

Remembering rda summary we can notice that even first two components have weak contribution into our distribution - olky 46 %. Which increases by 0.1 % with PC3 (meaning it has a weak role).

summ_rda$cont$importance[,1:6]
##                              PC1        PC2        PC3        PC4        PC5
## Eigenvalue            23.0487691 13.3587302 7.58645979 7.16154040 3.28241688
## Proportion Explained   0.2993347  0.1734900 0.09852545 0.09300702 0.04262879
## Cumulative Proportion  0.2993347  0.4728247 0.57135012 0.66435714 0.70698593
##                             PC6
## Eigenvalue            3.0423396
## Proportion Explained  0.0395109
## Cumulative Proportion 0.7464968

To see the clusterization of our data comparing with existing groups 3D visualization should be obtained (for explaination and taking into account the most representative data).

For this plotly package serves as a good method. It’s considered that we operate with 8 separate classes. However, gene contribution analysis revealed only two specific groups and two unsoecified ones. This seems peculiar and the cross-validation should be obtained.

#Prepare data for only PC1-3 visualization:
pc3_data = prcomp(data_wo_na[,-c(1:2,80:83)], rank. = 3, scale = T)

#Make dataset of loadings including existing classification of samples
##Necessary for valid clusterization
components <- as.data.frame(pc3_data[["x"]])
components$PC2 <- -components$PC2 #necessary for dimension correct vis.
components$PC3 <- -components$PC3 #the same as previous
components <- cbind(components, data_wo_na$class)

#Count the proportion of expanation of our distribution by 3 components
tot_explained_variance_ratio <- summary(pc3_data)[["importance"]]['Proportion of Variance',1:3]
tot_explained_variance_ratio <- 100 * sum(tot_explained_variance_ratio)

tit = 'Total Explained Variance = 57.135'
fig_pc <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = ~data_wo_na$class) %>%
 add_markers(size = 12)

fig_pc <- fig_pc %>%
 layout(
     title = tit,
     scene = list(bgcolor = "#e5ecf6")
 )
fig_pc

Task 5. Differential expression analysis

To conduct DESeq analysis data preparation should be implemented. Differential expression tests take two datasets for the analysis: expression levels in read resolution or count matrix in the case of samples/genes and description of sample-groups (wt, mut).

#Count matrix conversion:

count_df = pivot_longer(data_wo_na[,-c(1,2,80:82)], cols = -c(78))

#Get the expression data for each gene from all repetitions:
count_df_sum <- aggregate(formula = value ~ name + class, data = count_df, FUN = function (x) sum(x)*100) %>% pivot_wider(, id_cols = name, names_from = class, values_from = value)
count_df$name = gsub(x = count_df$name,pattern = '_N',replacement = '')
count_df2 = count_df %>% group_by(name, class) %>% mutate(idx = row_number()) %>% mutate(gene_rep = paste(name, idx, sep = '_')) %>% ungroup() %>% select(class, gene_rep, value)

#Transform to matrix format
count_df3 <- count_df2 %>% pivot_wider(id_cols = gene_rep, names_from = class, values_from = value) %>% as.data.frame()

#Change NA and type of values to integer where necessary:
count_wo_na = count_df3 %>% replace(is.na(.), 0)
count_wo_na = as.data.frame(mutate_if(count_wo_na,is.numeric, as.integer))

#Description data
meta_df = as.data.frame(unique(data_wo_na[,80:83]))
#See diff. expr in groups of genotype: control vs disease
dds_geno = DESeqDataSetFromMatrix(countData=count_wo_na, colData=meta_df, design=~Genotype, tidy = T)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds_g = DESeq(dds_geno)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
dds_res_g = na.omit(results(dds_g,tidy = T))
na.omit(results(dds_g))
## log2 fold change (MLE): Genotype Ts65Dn vs Control 
## Wald test p-value: Genotype Ts65Dn vs Control 
## DataFrame with 1654 rows and 6 columns
##               baseMean log2FoldChange     lfcSE       stat    pvalue      padj
##              <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
## ITSN1_1       0.130433      0.7408580  3.533812  0.2096484  0.833942   0.99657
## NR1_1         2.013787      0.0195626  1.091389  0.0179245  0.985699   0.99657
## NR2A_1        4.020681     -0.3447755  0.898006 -0.3839347  0.701027   0.99657
## pCAMKII_1     3.640248     -0.2839704  0.930513 -0.3051761  0.760232   0.99657
## pELK_1        1.010866      0.0264464  1.640379  0.0161222  0.987137   0.99657
## ...                ...            ...       ...        ...       ...       ...
## ADARB1_90     0.130433       0.740858   3.53381   0.209648  0.833942   0.99657
## pCASP9_90     0.130433       0.740858   3.53381   0.209648  0.833942   0.99657
## PSD95_90      0.260866       1.485539   3.53099   0.420714  0.673964   0.99657
## Ubiquitin_90  0.130433       0.740858   3.53381   0.209648  0.833942   0.99657
## CaNA_90       0.130433       0.740858   3.53381   0.209648  0.833942   0.99657
#See diff. expr in groups of treatment: Saline vs Memantine
dds_tr = DESeqDataSetFromMatrix(countData=count_wo_na, colData=meta_df, design=~Treatment, tidy = T)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds_t = DESeq(dds_tr)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
dds_res_t = na.omit(results(dds_t,tidy = T))
na.omit(results(dds_t))
## log2 fold change (MLE): Treatment Saline vs Memantine 
## Wald test p-value: Treatment Saline vs Memantine 
## DataFrame with 1654 rows and 6 columns
##               baseMean log2FoldChange     lfcSE       stat    pvalue      padj
##              <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
## ITSN1_1       0.130433    -0.73250396  3.533812 -0.2072844  0.835788  0.999643
## NR1_1         2.013787    -0.01112341  0.958196 -0.0116087  0.990738  0.999643
## NR2A_1        4.020681    -0.00990328  0.681287 -0.0145361  0.988402  0.999643
## pCAMKII_1     3.640248    -0.10767608  0.720797 -0.1493848  0.881250  0.999643
## pELK_1        1.010866    -0.75663032  1.664259 -0.4546349  0.649372  0.999643
## ...                ...            ...       ...        ...       ...       ...
## ADARB1_90     0.130433      -0.732504   3.53381  -0.207284  0.835788  0.999643
## pCASP9_90     0.130433      -0.732504   3.53381  -0.207284  0.835788  0.999643
## PSD95_90      0.260866      -1.479536   3.53071  -0.419048  0.675181  0.999643
## Ubiquitin_90  0.130433      -0.732504   3.53381  -0.207284  0.835788  0.999643
## CaNA_90       0.130433      -0.732504   3.53381  -0.207284  0.835788  0.999643
#See diff. expr in groups of behavior:C/S vs S/C
dds_bh = DESeqDataSetFromMatrix(countData=count_wo_na, colData=meta_df, design=~Behavior, tidy = T)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
dds_b = DESeq(dds_bh)
## estimating size factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
dds_res_b = na.omit(results(dds_b,tidy = T))
na.omit(results(dds_b))
## log2 fold change (MLE): Behavior S.C vs C.S 
## Wald test p-value: Behavior S.C vs C.S 
## DataFrame with 1654 rows and 6 columns
##               baseMean log2FoldChange     lfcSE        stat    pvalue      padj
##              <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
## ITSN1_1       0.130433      -0.732504  3.091069 -0.23697417  0.812677  0.999317
## NR1_1         2.013787      -0.011199  1.141095 -0.00981425  0.992169  0.999317
## NR2A_1        4.020681      -0.190931  0.559356 -0.34134093  0.732847  0.999317
## pCAMKII_1     3.640248       0.492177  0.618472  0.79579484  0.426151  0.999317
## pELK_1        1.010866      -0.757598  1.839689 -0.41180770  0.680480  0.999317
## ...                ...            ...       ...         ...       ...       ...
## ADARB1_90     0.130433      -0.732504   3.09107   -0.236974  0.812677  0.999317
## pCASP9_90     0.130433      -0.732504   3.09107   -0.236974  0.812677  0.999317
## PSD95_90      0.260866      -1.479536   3.53071   -0.419048  0.675181  0.999317
## Ubiquitin_90  0.130433      -0.732504   3.09107   -0.236974  0.812677  0.999317
## CaNA_90       0.130433      -0.732504   3.09107   -0.236974  0.812677  0.999317
##Check p-values in all deseq sets:

as.data.frame(summary(dds_res_t)[,6])
##   summary(dds_res_t)[, 6]
## 1       Min.   :0.03282  
## 2       1st Qu.:0.65651  
## 3       Median :0.82666  
## 4       Mean   :0.74374  
## 5       3rd Qu.:0.84073  
## 6       Max.   :0.99964
as.data.frame(summary(dds_res_b)[,6])
##   summary(dds_res_b)[, 6]
## 1       Min.   :0.05536  
## 2       1st Qu.:0.68340  
## 3       Median :0.81369  
## 4       Mean   :0.78629  
## 5       3rd Qu.:0.86286  
## 6       Max.   :0.99932
as.data.frame(summary(dds_res_g)[,6])
##   summary(dds_res_g)[, 6]
## 1        Min.   :0.1599  
## 2        1st Qu.:0.6771  
## 3        Median :0.8258  
## 4        Mean   :0.7880  
## 5        3rd Qu.:0.8426  
## 6        Max.   :0.9966

According to the results we could admit that in variable-specific datasets the expression of not all genes is affected by the conditions.

For instance, pvalue statistics tells that only in the case of treatment significant expression variance can be noticed. It was shown that the significant expression variance was obtained in the case of pCAMKII_N_63. This gene has two times increased expression (LFC = 2.4), signifying about its contribution under Memantine treatment.

However genotype-specific expression rate of this gene didn’t change significantly (only LFC = 0.8).

Task 6. Data visualization with MCA and shrinkage

# Data preparation:
dds_res_t2 = results(dds_t, name = 'Treatment_Saline_vs_Memantine')
dds_res_g2 = results(dds_g, name = 'Genotype_Ts65Dn_vs_Control')
dds_res_b2 = results(dds_b, name = 'Behavior_S.C_vs_C.S')

res_t_LFC = lfcShrink(dds_t, coef = 'Treatment_Saline_vs_Memantine',  type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_g_LFC = lfcShrink(dds_g, coef = 'Genotype_Ts65Dn_vs_Control',  type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_b_LFC = lfcShrink(dds_b, coef = 'Behavior_S.C_vs_C.S',  type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
# Data visualization
plotMA(res_t_LFC, main = 'LFC of genes in treatments groups')

plotMA(res_g_LFC, main = 'LFC of genes in genotype groups')

plotMA(res_b_LFC, main = 'LFC of genes in behavior groups')

To analyze the significance of our expression data it’s better to check shrinkage LFC, being more sensitive to gene ranks (~principle components they explain). Indeed, we see that no significant deviations of gene expressions can be described. Neither in treatment, nor genotype and behavior groups.

But that do not allow to reject any association between division groups. For the latter heatmap visualization could be obtained.

#Data preparation and visualization (seems to be the same, as equal dependencies are checked):

select_t <- order(rowMeans(counts(dds_t,normalized=TRUE)),decreasing=TRUE)[1:20]
df_t <- as.data.frame(colData(dds_t)[,c("Treatment", "Genotype", "Behavior")])
ntd_t <- normTransform(dds_t)
pheatmap(assay(ntd_t)[select_t,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df_t, main = 'Heatmap of treatment data')

select_g <- order(rowMeans(counts(dds_g,normalized=TRUE)),decreasing=TRUE)[1:20]
df_g <- as.data.frame(colData(dds_g)[,c("Genotype", "Treatment", "Behavior")])
ntd_g <- normTransform(dds_g)
pheatmap(assay(ntd_g)[select_g,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df_g, main = 'Heatmap of genotype data')

select_b <- order(rowMeans(counts(dds_b,normalized=TRUE)),decreasing=TRUE)[1:20]
df_b <- as.data.frame(colData(dds_b)[,c("Behavior", "Genotype", "Treatment")])
ntd_b <- normTransform(dds_b)
pheatmap(assay(ntd_b)[select_b,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df_b, main = 'Heatmap of behavior data')

Here we see, that there is a significant correlation of gene expression when mice from control groups were treated with saline. They possess higher gene expression, comparing with mean one. In comparison, mice from disease-model group in most cases have lower gene expression rates, which, intriguingly, decrease after memantine treatment.

Conclusion

Overall, we can admit that some association could be recognised. And, definetly, memantine treatment decrease level of some genes under disease conditions. However, more precise information is necessary and the better quality, data representation, group distribution and volume should be obtained as with such results none significant differencial expression changings can be noticed. And group division is not accurate.